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Let S be partitioned into rx« disjoint sets Ei and F ; * where the general 
subset is denoted E{ n Fj . Then the marginal probability of Ei is 
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The distribution of a variable is a description of the relative numbers of 
times each possible outcome will occur in a number of trials. The function 
describing the distribution is called the probability function , and the function 
describing the cumulative probability that a given value or any value smaller 
than it will occur is called the distribution function . 

Formally, a distribution can be defined as a normalized measure , and the 
distribution of a random variable x is the measure P x on S' defined by 



setting 



F*(A')=P{*€S :«(*)€ A'}, 



where (S>$ f P) is a probability space , (S f S) is a measurable space , and P a 

measure on § with P($) =r 1 . If the measure is a Radon measure (which is 

usually the case), then the statistical distribution is a generalized function in 
the sense of a generalized function. 
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Abstract 

This paper deals with different speaker adaptation methods 
for speech recognition systems adapting automatically to new 
and unknown speakers in a short training phase. The adap- 
tation techniques aim at transformations of feature vectors, 
optimised with respect to some constraints. Two different 
adaptation strategies are appropriate. The first one is ba- 
sed on least mean squared error (MSE) optimization. The 
second method is a codcbook-driven feature transformation. 
Both adaptation techniques are incorporated into two dif- 
ferent recognition systems: dynamic time warping (DTW) 
and Hidden Markov Modelling (HMM). The results show, 
that in both systems speaker-adaptive error rates are close to 
speaker-dependent error rates. In the best case the mean error 
rate of four test speakers decreases by a {actor of 6 (DTW- 
recognizer) resp. 3 (HMM-recognizer) compared to the inter- 
speaker error rate without adaptation. Finally a hardware 
realisation of the speaker-adaptive HMM- recognizer will be 
described. 

1 Introduction 

Fast speaker adaptation is of increasing importance for 
speech recognition systems with large vocabulary. The tra- 
ditional way to train a system to each user's voice (speaker- 
dependent system) is to utter each vocabulary word once or 
several times. This procedure is no longer acceptable with 
increasing vocabulary size. Furthermore recognition schemes 
such as HMM or Neural Networks need a high amount of trai- 
ning data to optimize the classification parameters. 
Hence new methods are applied to adapt the system to a new 
speaker. The recognition system is pretrained using training 
data of one or several reference speakers. This primary trai- 
ning effort may be very high. The system is then adapted 
to a new unknown user, who has to utter only few words or 
phrases. Two strategies have been investigated during the last 
years: adaptation of the pretrained classification parameters 
[1-3] and adaptation by transformation of the feature vectors 
[4-7]. We pursue the second strategic called spectral mapping, 
where the problem is to find a suitable transformation. 
In section 2 several optimization methods are described. 
These methods are tested with two different classification 
schemes: DTW and HMM. Experiments and results are 
shown in section 3. Finally a description of a hardware reali- 
sation is given in section 4. 
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of this publication. 



2 Feature Vector Transformations 

We have investigated five methods in order to determine 
a suitable. transformation. Four of them use transformation 
matrices optimised with respect to the MSE-criterion. The 
fifth method is an adaptation strategy based on vector quan- 
tization. 

2.1 Common MSE- Optimization 

Let us find a transformation X = A T ■ X, where X is a feature 
vector of the new speaker, A is a transformation matrix and 
x is the transformed vector. To estimate A we mirumize the 
mean-squared error 

D = E [(Y - A T • X) T . (Y - A T . X)] (1) 

between a feature vector Y of the reference speaker and the 
corresponding transformed vector A T • X of the new speaker. 
The solution of this minimization problem is given by 

A={E[XX T ])- 1 >E[XY T ]. (2) 

The expected values E [XX T ] and E fXY r ] are estimated 
using corresponding vectors X and Y ol the same words utte- 
red by the new and reference speaker in a short training phase. 
In order to get a proper time- alignment we use DTW without 
slope constraint, i.e. each reference utterance {Y} is time- 
warped against each corresponding utterance {X} of the new 
speaker yielding a set of corresponding vectors {X(i),Y(»)}, 
where i are frame indices. This procedure is UBed identically 
with all methods described in this paper. 
The optimization strategy works well for mel-frequency- 
coefficients (MFC), but it denies for mel-cepstral-coefficients 
(MCC), i.e. the recognition rate decreases if the transforma- 
tion is applied. Evidently additional constraints are necessary 
to find an optimal transformation for MCCs. 



2.2 MSB-Optimization with constraints 

The method is called MSE.C. In the following we introduce 
the constraint, that the variance of each component x k of the 
transformed vector x should be equal to the variance of the 
corresponding component of the reference vector Y 

E [xl] = E [Y>] . (3) 

With ( 1) we get a new optimization criterion written for each 
component k 

D k = E [si] - 2E [x k -Y k ] + E [Y b 2 ] I min. (4) 

Obviously the error component D k is minimized, if the corre- 
lation 

E [x k . ft] = a? • E [XK] = max (5) 
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between the transformed vector component x k = af • X and 
the reference vector component Y k is maximised, where a h is 
a transformation vector. To solve this maximization problem 
under the constraint ( 3) we use the Lagrange method and 
obtain 

[xx*] r'-Btxn] (e) 

with the Lagrange parameter 

xl = wf] ' B [xn)T ' ( E i 3 ™'] ) " l * s [xn] • < 7 > 

Eq. ( 6) corresponds - with the exception of the factor 1/A* 
- to the solution ( 2) in the un constraint case. Finally the 
desired matrix A is built up with the vector a 4 as the k-th 
column vector. 



2.3 Transformation into a joint feature space 

This method (called GRE) is based on a technique for 
spectral transformation proposed by GRENIER et al.[5]. The 
idea is to transform the feature vectors X of the new speaker 
as well as the vectors Y of the reference speaker into a joint 
feature space by means of linear transformations x = P L . X, 
y = Vn * Y. The transformation matrices Vt and Vn have to 
be determined in such a way, that the Euclidian metric 

D = B \(V L . X - V R -Y) T {V L . X - V R . Y)] (6) 

between the transformed vectors x and y is minimum. Since 
the trivial, but non satisfactory solution Vn = Vl = 0 accom- 
plishes this criterion, we introduce the additional constraint 

*[«i]=*[?I]=l. (9) 

i.e. we require unit variance for the components of both trans- 
formed vectors. With respect to this normalisation the tnini- 
muation problem can now be formulated for each component 
according to 

Dk=E [(x k - Vk ) 3 ] =2(1 - E[x k y k )) = min. (10) 

Thus D k is minimum, if the components x* and y k of the 
target vectors x and y are maximally correlated. The solu- 
tion can be found by applying 'canonical correlation analysis' 
[7,8]. This leads to a generalised eigenproblem, which can be 
solved by employing techniques known from the singular va- 
lue decomposition [9] of a matrix. This matrix contains the 
auto- and crossco variance matrices of the two speakers, which 
are estimated using corresponding feature vectors in a short 
training phase. 

A modified version (GRE.lT) of this method is to com- 
bine the two matrices Pl and Pn to a single one in order to 
avoid computations for the speaker adaptation during appli- 
cation, i.e. only the feature vectors Y of the reference speaker 
are transformed once after the short training phase: 

9 = P£ T • PK ' Y(t) = (Pn - P?) T • Y(i) (11) 

2.4 Nonlinearly extended feature vectors 

It is obviouB, that the performance of an adaptation proce- 
dure with linearly transformed vectors is limited due to non- 
linear dependencies. On the other hand optimization pro- 
blems based on linear transformations can be solved in closed 



forms. We can combine these two reflections by applying the 
linear transformations to nonlinearly extended feature vec- 
tors (GRE.Q). A primary feature vector v = (vi , v 3 , •■ ■ , v K ) 
is extended here to a polynomial vector of second order by 
forming quadratic combinations of the components: vq = 
(vi,V2,"- .VK.vJ.viua,...,^). This extension is performed 
for the test as well as for the reference templates. Concerning 
the calculation of the transformation matrices we con proceed 
in the same way as described above. 

A combination of the two matrices to one is possible, too. 
This method is called GRE_Q.1T. 



2.5 Transformation by use of a codebook 

The idea is to use a quantised feature space in order to get 
a suitable transformation. This is done by means of a co- 
debook. Thus any nonlinear transformation can be realised. 
Each feature vector of the reference speaker is mapped into 
the quantised feature space through vector quantisation yiel- 
ding codebook symbol 5 m and then replaced by a new feature 
vector, which is related to this Voronoi cell 5,« (note: the new 
feature vector is not the centroid of the cell). This new fea- 
ture vector has been created by a linear combination of feature 
vectors of the new speaker (the method may be applied vice 
versa, too). For computing the linear combinations we use a 
codebook. After investigating several variations, the following 
procedure (method CB, fig. 1) was implemented: 
■) training phase 
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Figure 1: Implementation of tlie CB- adaptation-strategy 

• Corresponding vectors X(i) and Y(i) are determined 
using DT W. 

• Each reference vector Y(t) is uniquely mapped into a 
codebook symbol S m using vector quantization. 

• For each codebook symbol S m we now compute the 
mean vector c(m) of all vectors X(i), whose correspon- 
ding vectors Y(i) mapped into S m . This mean vector 
can be computed recursively during the training phase. 

• At the end of the training phase each reference vector 
Y(i) is replaced by the linear combination c(m), which 
corresponds to its code symbol 5 m . 

In the recognition phase the vectors X(i) of the new speaker 
remain unchanged. This adaptation strategy is similar to that 
one proposed by Shikano et al. [4], 
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3 Experiments and Results 

We used two different recognition schemes in order to test 
the perfotmance of the various adaptation methods: a DTW- 
and HMM-recognuer. 

Teat conditions: The signal was low-pass filtered and 
sampled with 12 kHz sampling frequency. Then mel-cepstral- 
coefficienta (MCC) were computed every 10 msec. We used 
K — 10 MGCs per feature vector. Speaker-dependent mean 
values of the MCC'b were subtracted. The test-vocabulary 
consisted of 100 common german words, Bpokcn by 4 male 
(AiBiCjD) and 1 female (E) speaker. Two sets of 100 
words/speaker (SI and S2) were recorded on two different 
days. Furthermore a database of about 15 min. of speech 
was established, spoken by speaker A in order to design a co- 
debook and to train the HMM's. Therefore speaker A was 
defined as reference speaker; the remaining 4 speakers formed 
the test set. 

The quality of the adaptation can be evaluated by com- 
paring the speaker-dependent error rates (SD), the speaker- 
adaptive error rate and the error rate without any adaptation 
(WA). The SD was measured using Si and S2 of a speaker as 
reference resp. test set (DTW-recognizer). WA is obtained 
by classifying S2 of all speakers except ref. speaker A. The 
adaptation of speaker B to A e.g., is performed using Si of 
both speakers to compute transformation matrices. The per- 
formance of the adaptation is controlled by classifying S2 of 
each test speaker after transformation of the according speech 
samples. 

DTW-recognlaer: The DTW-system was an isolated 
word recognizer based on city-block distance measure, which 
was not optimised for extraordinarily high performance; 
Fig. 2 and table 1 show the results for all test speakers, each 
of them adapted to reference speaker A. 




__ZI DTW-recognizer 
■ HMM-recognizer 



GRE Q 

6.7 7 - 2 



Figure 2: Mean error rates (%] of the different adaptation 
methods; 100 words hi training phase 1.5 ruin, of 
speech) 

The mean error rate decreases from 42.7% without adapta- 
tion to 23% (MSE.C), 19.2% (GRE.lT), 11.5% (GRE), 8% 
(CB) t 6.7% (GRE.Q). For method CB we used a speaker- 
dependent codebook (siae 256). The result of method CB 
with speaker- independent codebook (10 extraneous speakers, 
900 word vocabulary, Bize 256) was 10%. 
The best results have been achieved by applying GRE.Q. The 
error rate is by a factor of 5 below WA and is within the scope 
of the SD (5%). It is worth noting, that the best results are 
obtained by using only the first 10 trarw/ormed components 
of the quadratically extended feature vectors for further clas- 
sification. 



Speaker adaptation should aim at a short training phase. 
Therefore the amount of samples necessary for optimizing the 
transformation matrices is another criterion for evaluating the 
performance of the adaptation procedure. Thus we used the 
first n (10 < n < 100) templates of the training sets. From 
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Figure 3: Mean error rate vs. different number of training 
templates; DTW-recogniaer 

fig. 3 it becomes obvious that for the GRE-method 20 words 
are sufficient to train the required parameters. The other me- 
thods, however, need about 40 words for convergence. 
A further reduction of the training phase with respect to the 
new speaker is possible if we store several repetitions of the 
reference speaker's utterances. Tested with CB-method we 
obtained about the same results for a training vocabulary spo- 
ken once by new and reference speaker compared to a training 
vocabulary of half size spoken once by the new speaker and 
5 times by the reference speaker. 

The influence of the reference speaker was investigated in a 
further experiment, i.e. we used speaker B instead of speaker 
A as the reference speaker. The results differ only slightly 
from those above for all methods. Therefore we conclude, that 
the choice of the reference speaker ib no critical parameter for 
the adaptation methods. However, further investigations will 
be necessary to confirm these results. 

Another point of view was the choice of the adaptation vo- 
cabulary. Some experiments showed that the training phase 
can be minimised if the training vocabulary is phonetically 
balanced. If not, the training phase must be longer to get 
optimal results. 

HMM-recognizert Our HMM recognition system is de- 
scribed in detail in [6]. Words are represented by a series of 
HMM's of subword units. The phonetic graphs, whose nodes 
are the HMM's, are automatically generated by rules from 
the standard orthographic descriptions of words. These de- 
scriptions are stored in the Lexicon. The phonetic descrip- 
tion of a word is a graph because usual alternate pronouncia- 
tions are taken into account using these rules. 
HMM's are described by continuous transition and discrete 
emission probabilities. Therefore vector quantization (VQ) is 
necessary. VQ is carried out by means of a speaker- dependent 
codebook (sise 128). Furthermore we use speaker- dependent 
models of subword units. 

Applying the proposed adaptation methods we have to incor- 
porate feature vector transformations in the HMM-system. 
For this purpose transformation matrices are computed in the 
same manner as described above. 

For the application we have to distinct between one-sided 
and two-sided adaptations (transformation of one or both 
speaker's vectors). The one-sided methods MSB.C, GRE_1T, 
GRE_Q.1T and CB are directly applicable to transform the 
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Table 1: Error rates [%] of all adaptation methods; ref. speaker A, test speakers B-E; training at 100 words. 



feature vectors of the new speaker, On the opposite, the two- 
sided methods GRE and GRE-Q transform the vectors of the 
reference speaker, too. In a real application we don't have ap- 
plication vocabulary feature vectors of the reference speaker, 
i.e. he is represented by the code bo ok and HMM 'a are trai- 
ned using this codebook. Therefore the conventional but ex- 
pensive way for speaker adaptation would be the following: 
transformation of all training material (reference speaker's ut- 
terances); codebook-generation with transformed data; trai- 
ning the HMM's with transformed data. This procedure can 
be shortened by transforming the codebook centroids instead 
of reference speaker's vectors. This is equivalent to a transfor- 
mation of the quantized reference speaker's space. Hence the 
one-sided adaptation methods axe applicable in the same man- 
ner , i.e. transforming the codebook centroids instead of the 
new speaker^ feature vectors to adapt the reference speaker 
(the system) to the new speaker. 

The results of the speaker adaptive HMM-system are shown 
in figure 2 and table 1. 

In principle there is the same behaviour of both DTW- 
and HMM-recognizer. Method GRE.Q is the best, too, with 
a mean error rate of 7.2%. This is by a factor of 3 below WA. 
We obtained significant reduction of the mean error rate for 
all adaptation methods with the exception of CB-method. Gi- 
ving excellent results for DTW-recognizer, it is only slightly 
better than WA of the HMM-recognizer. We will try to find 
out the reason in our future work. 

4 Realization of the speaker-adaptive 
HMM-recognizer 

The algorithms of the speaker-adaptive HMM-recognizer 
are implemented on a PC-based demonstration system. Data 
acquisition (MCC-computation) as well as preprocessing are 
realised on a purchasable PC-board on the basis of the TMS 
320C25. 

A second board is based on the floating-point signal processor 
DSP32C from AT&T and appropriate for performing speaker 
adaptation and classification. A floating- point processor was 
chosen because of algorithmic complexity. The feature vectors 
are transformed and quantized using the codebook of the refe- 
rence speaker. Finally the HMM- algorithm based on subword 
models determines the N best word candidates, the labels of 
which are transmitted to the host. 

The DSP32C board is equipped with a memory extension 
board of 4 MByte. At the moment the system is designed 
to operate with an active vocabulary of about 1000 isolated 
words. The concept, however, is kept flexible. Therefore a 
larger vocabulary as well as a continuous speech recognition 
option can be integrated easily. 



5 Conclusions 

We have investigated several speaker adaptation methods 
by feature vector transformations. Most of the proposed me- 
thods use only one transformation matrix (one-sided adapta- 
tion), i.e. they can be organized in such a way, that the refe- 
rence speaker's vectors are transformed once after the training 
phase and therefore no computations are necessary for adap- 
tation in the application phase. Two methods are two-sided, 
which transform both the new and ref. speaker's vectors. 
The adaptation procedure consists of two steps: a) computing 
a transformation matrix (resp. matrices) automatically using 
a few utterances (20 - 40) spoken by the new speaker in a short 
training phase, and b) transforming the feature vectors. The 
methods axe invariant with respect to vocabulary and classi- 
fication scheme because they are based on unlabeled feature 
vectors. The experiments indicate that all methods result in 
significant improvements, some of which lie in the scope of 
the speaker-dependent error rate. In the best case the mean 
error rate decreases by a factor of 6 (DTW-recognizer) resp. 
3 (HMM-recognizer) compared to the inter-speaker error rate 
without adaptation. 
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